Here’s a link to a short survey:
https://forms.gle/jEtCucpW3W6Fjr1u9
If you want to follow along with the analysis, you can copy-paste code from here:
Identify features of a variable
Recognize the level of measurement
Describe central tendency
Describe dispersion
Definition: identifying concrete features of the objects or events we’re studying and the tools to measure them.
Measurement representing real-life objects and events as variables
Description making generalizations about variables
The results of the measurement process.
By definition a variable must vary
How we describe a variable will depend on how we measure it.
3 broad “types” of variables, each with slightly different properties1
Nominal
Ordinal
Interval
The region of a particular U.S. state is a nominal variable: there’s a fixed number of categories and no intrinsic ordering
Ordinal variables have a small number of categories that can be ordered.
However, the gaps between differing ranks may be unequal.
Top finishing times for Boston Marathon in 2024. The placements are ordinal: the distance in time between first and second place doesn’t necessarily equal the distance between second and third.
A common source of ordinal variables will be survey items that ask people to rate their position on a scale from “strongly agree” to “strongly disagree” or “very important” to “not at all important”.
ANES question from 2020 about the importance of people agreeing about basic facts
Another source of ordinal variables might be data that we’ve grouped into ordered categories based on another variable.
For instance: the World Bank classifies countries into four ordinal categories based on their per-capita GDP. There’s a clear ordering, but the spaces are not equal.
Two things to keep in mind when working with ordinal variables:
The “ordering” might be partly a question of your own research question. You could flip these or combine categories to arrange them from “most partisan” to “least partisan”.
Some survey variables may only be ordinal after you remove all the people who gave “don’t know” responses.
These are just numbers. They’re measured along a continuum with equal spacing (i.e., the difference from 3 and 4 is “the same” as the difference between 6 and 7)
Examples: age, height, temperature, distance.
“True” interval variables are less common in survey research, but we’ll often treat ordinal variables as “more-or-less interval” if they a lot (7+) categories
Strictly speaking these “feeling thermometer” responses are more like ordinal variables. But they’re close enough for us to treat them like interval variables for most purposes.
Dichotomous variables that take on only two values: TRUE/FALSE, or Republican/Democrat, War/Peace etc.
“Dummy” variables will encode this dichotomy as 0s and 1s, which can simplify some math operations
Dummy coding is important for statistical modeling because how we can make nominal data into meaningful numeric data:
| Who did you vote for in 2020? | Trump | Biden | Stein |
|---|---|---|---|
| Trump | 1 | 0 | 0 |
| Biden | 0 | 1 | 0 |
| Jill Stein | 0 | 0 | 1 |
Some surveys may measure an attitude by asking multiple questions on the same topic and then aggregating those responses. These aggregate indexes are often treated as interval-level variables.
The values highlighted in red indicate more “authoritarian” attitudes toward child rearing.
| question | value | percent |
|---|---|---|
| Considerate vs. Well-behaved | Being considerate | 72% |
| Well behaved | 28% | |
| Curiosity vs. Good Manners | Curiosity | 39% |
| Good manners | 61% | |
| Self Reliance vs. Obedience | Obedience | 44% |
| Self-reliance | 56% | |
| Independence vs. Respect for Elders | Independence | 31% |
| Respect for elders | 69% |
The “authoritarianism” column is an index variable created by counting the number of “authoritarian” answers to a single question.
| authoritarianism | percent | cumulative % |
|---|---|---|
| 0 | 18.3% | 18.3% |
| 1 | 17.7% | 36.0% |
| 2 | 23.2% | 59.2% |
| 3 | 25.1% | 84.2% |
| 4 | 15.8% | 100.0% |
It is often possible to measure the same variable at multiple levels of measurement.
“Education”, for instance, could be recorded as interval, ordinal, or dichotomous:
| Years of schooling | Highest Level of Schooling Completed | Some College |
|---|---|---|
| 9 | Less than High School | No |
| 10 | Less than High School | No |
| 12 | High School | No |
| 13 | Some College | Yes |
| 14 | Some College | Yes |
| 15 | Some College | Yes |
| 16 | Bachelor’s | Yes |
| 17 | Post Bachelor’s | Yes |
Its often preferable to use the highest level of precision available, but sometimes we choose a less precise measure because its more parsimonious, less “noisy”, or easier to display in a table or graph.
For example: many analyses of Trump supporters will collapse education into college vs. non-college because (at least for whites) there’s a clear division between college grads and non-college grads:
Keep in mind that aggregation changes the unit of analysis. “Did you vote for Trump in 2020?” is dichotomous, but Trump’s share of the vote across an entire state is continuous:
The level of measurement will be a key constraint on our choices of descriptive statistics and graphs.
When we get to statistical modeling, variable types will matter for the kinds of models we can use.
Variable types will also matter for how data are stored and analyzed in R.
Use: visualizing the distribution of interval variables.
Divide data into equally sized “bins” and count the number in each. The height of each bar indicates the number of values in that bin.
Use: visualizing the distribution of interval variables
Sort of a “smoothed” version of the histogram. The area of the entire curve is one, and the height of the curve at a given point indicates how much of the data is in that region.
Use: visualizing the distribution of interval variables.
Shows the “five number summary” (minimum, 25th percentile, median/mean, 75th percentile, maximum)
Especially useful for making comparisons across groups or describing multiple items with similar scales.
Use: visualizing the distribution of categorical variables
Count the frequency (or proportion) of observations in each group
| Nominal | Ordinal | Interval | |
|---|---|---|---|
| Bar chart | ✅ | ✅ | ✅ |
| Histogram | ❌ | ❌ | ✅ |
| Density plot | ❌ | ❌ | ✅ |
| Box Plot | ❌ | ❌ | ✅ |
In addition to visualization, we generally want to be able to summarize and compare characteristics like:
Central Tendency: “typical values” of the variable
Dispersion: the amount of spread around the central tendency
Modality: the number of “peaks” or “modes” in a distribution.
Skewness: the amount of asymmetry in a variable.
(some things you probably remember from school)
Sum up all the numbers and divide by the total number of observations
\[ \bar{x} = \frac{1}{n}\sum^n_{i=1}x_i \]
\[ \bar{x} = \text{the mean of x} \]
\[ x_i = \text{the individual values of x} \]
\[ n = \text{the number of observations} \]
A useful feature of the mean: the summed residuals from \(\bar{x}-x = 0\)
| \(x\) | \(x-\bar{x}\) |
|---|---|
| 3 | \(3 - 6 = -3\) |
| 4 | \(4 - 6 = -2\) |
| 6 | \(6-6 = 0\) |
| 11 | \(11 - 6 = 5\) |
| Total: \(24\) Mean: \(24/6 = 6\) |
Total: \(-3 + -2 + 0 + 5 = 0\) |
Using the mean to predict an outcome means that we’ll sometimes predict values that are too high or too low, but over time the total sum of those errors will equal zero.
A problematic feature of the mean is that its sensitive to extreme outliers (also known as skew)
The average height of Victor Wembanyama (7 ft 4) and a bunch of regular people is probably misleading.
For an even number of observations, the median is the middle number:
\[ x = 1, 3, 3,6, 7, 8, 9 \]
\[ \text{Median} = 6 \]
For an odd number of observations, the median is the mean of the two middle values:
\[ x = 1, 3, 3,6, 7, 8, 9, 11 \]
\[ \text{Median} = 6.5 \]
Importantly, the median is a skew-robust measure of central tendency.
The mean and median will be similar if there’s no skew:
\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]
\[ \text{Median of x} = 6.5 \]
\[ \text{Mean of x} = 6 \]
But they diverge when we include extreme outliers:
\[ x = 1, 3, 3, 6, 7, 8, 9, 100000000000 \]
\[ \text{Median of x} = 6.5 \]
\[ \text{Mean of x} = 12500000000 \]
The modal value is the value that occurs most often.
\[ x = 1, 3, 3, 6, 7, 8, 9, 11 \]
\[ \text{Mode of x = 3} \]
Unlike the mean, the mode is a valid measure of central tendency for nominal variables:
\[ \text{Tom, Earl, Tom, Sarah, Beth} \]
\[ \text{Mode = Tom} \]
Variables may have more than one modal value. For instance, the cross-national distribution of male average years of schooling is roughly bimodal, mainly because different countries will different compulsory schooling requirements.
By contrast the % of a country’s population that is working-age is unimodal: most countries have between 65 to 70% and there’s no other value that is nearly as common.
| Nominal | Ordinal | Interval | |
|---|---|---|---|
| Mean | ❌ | ❌ | ✅ |
| Median | ❌ | ✅ | ✅ |
| Mode | ✅ | ✅ | ✅ |
The standard deviation is a … standard measure of dispersion for interval variables based on squared deviations from the mean. A larger standard deviation, all else equal, indicates that observations tend to deviate from the mean more.
To calculate the standard deviation for a sample:
1. Calculate \(\bar{x}\) (the mean of \(x\))
2. Calculate the residual (\(\bar{x} - x_i\)) for each value
3. Square each residual and then calculate the total sum of squares (TSS)
3. Calculate the variance by dividing this total by the number of observations (minus 1)
4. Calculate the standard deviation by taking the square root of the variance
| x | Deviation from mean (5) | Differences squared |
|---|---|---|
| 2 | -3 | 9 |
| 4 | -1 | 1 |
| 4 | -1 | 1 |
| 4 | -1 | 1 |
| 5 | 0 | 0 |
| 5 | 0 | 0 |
| 5 | 0 | 0 |
| 7 | 2 | 4 |
| 9 | 4 | 16 |
| Mean = 5 | Total = 0 | TSS = 32 |
\[\text{Var(x)}=\frac{32}{(9-1)} = 4\] \[s_x = \sqrt4 = 2\]
Fortunately, we don’t have to do this by hand:
The key thing to remember is just that the standard deviation is sort of like “an average of differences from the average”
Range is simply the difference between the lowest and highest value
Interquartile Range is the difference between the 25th and 75th quartile of a variable
(which corresponds to the box part of a box-and-whiskers plot)
| Nominal | Ordinal | Interval | |
|---|---|---|---|
| Standard Deviation | ❌ | ❌ | ✅ |
| IQR | ❌ | ✅ | ✅ |
Skew refers to the degree of asymmetry in data.
When the distribution is basically symmetric, the mean and the median essentially overlap.
With right skew, extreme high values pull the mean higher than the median.
Pilot design questions due Feb 7.
Homework 1 due Feb 13.